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(54) Method for reducing the number of coherency cycles within a directory-based cache 
coherency memory system utilizing a memory state cache 



(57) The present invention relates to a method for 
replacing entries within a state cache memory of a mul- 
tiprocessor computer system. The computer system 
has, in addition to the state cache memory, a shared 
system memory, a plurality of data cache mennories. a 
system of busses interconnecting the system memory 
with the data cache memories, and employs a central- 
ised/distributed directory based cache coherency 
scheme for maintaining consistency between lines of 
memory within said shared system memory and the 
data cache memories. 

The method establishes a default memory state of 
SHARED for lines of memory represented in the state 
cache memory. The system memory line state for a 
state cache entry associated with a line of meoK^ry 
stored in the shared memory and at least one data 
cache memory is read prior to its replacement. A 
castout operation updates the line of memory within the 
shared memory and assigns a data cache memory line 
state of SHARED to the line of memory in each data 
cache memory if the system memory line state is 
OWNED, 
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Description 

The present invention relates to nrrultiprocessor computer systems having multiple data cache memories and a 
shared memory and. more particularly, to multiprocessor computer systems employing directory-based protocols for 

5 maintaining cache coherency. 

The past several years have seen near exponential increases in the performance, speed, Integration density, and 
capacity of computer systems. These improvements coupled with the decrease in costs for computer systems, have 
resulted in more expansive utilization of computer systems the development of more sophisticated and resource inten- 
sive computer applications. According to recent historical trends, application memory requirements double yearly 

10 Although the costs for computer systems and components has steadily declined in recent years, high speed RAM mem- 
ory utilized in system main memory and cache memories remains one of the highest cost components within most com- 
puter systems. 

System and cache memories, used primarily for the tenrporary storage of data, application software and operating 
system software, are also being utilized within more sophisticated multiprocessor systems for the storage of parity bits. 

15 cache coherency state information, and error detection and/or correction syndrome bits. These additional memory 
requirements of multiprocessor systems, and the higher memory demands of advanced operating systems and appli- 
cations, result in an increased demand, and cost for high speed RAM. 

More efficient methods for utilizing high speed system and cache memory, and for reducing system and cache 
memory requirements, are desired. 

20 It is therefore an object of the present invention to provide a new and useful method for improving menrory utiliza- 
tion within a computer system employing directory-based cache coherency. 

According to the invention a method for replacing entries within a state cache mennory of a multiprocessor compu- 
ter system which includes the state cache memory, a shared system mennory. a plurality of data cache memories, a sys- 
tem of busses interconnecting the system memory with said plurality of data cache memories, the computer system 

25 employing a centralised/distributed directory based cache coherency scheme for maintaining consistency between 
lines of memory within said shared system memory and said plurality of data cache memories, the method comprising 
the steps of; 

establishing a default memory state of SHARED for lines of memory represented in the state cache memory, 
30 reading, prior to its replacement, the system memory line state for a state cache entry associated with a line of 
memory stored in the shared memory and at least one data cache mennory. and 

performing a castout operation to update the line of mennory within said shared memory and assigning a data 
cache memory line state of SHARED to said line of memory in each data cache memory containing the line of 
memory if said system memory line state is OWNED. 

35 

A multiprocessor computer system to which the present invention relates is described in European Patent Applica- 
tion Number . filed concurrently with the present application. 

Information concerning the state of a line of mennory is maintained within the state cache memory and in the data 
cache mennories. Each data cache memory contains a data cache memory line state with each line of memory saved 
40 within the data cache memory, the data cache memory line state being any one of the MESI (Modified-Exclusive- 
Shared-lnvalid) states: MODIFIED, EXCLUSIVE, SHARED, or INVALID. The state cache memory contains a system 
memory line state for a predetermined number of lines of memory saved within the system memory, the system mem- 
ory line state being any one of the following states: SHARED BUS A, SHARED BUS B. SHARED BOTH, OWNED BUS 
A and OWNED BUS B. 

45 The method for performing state cache line replacement operations includes the following steps: establishing a 
default system memory line state of SHARED for lines of memory represented in said state cache mennory; reading the 
system memory line state for a previously stored state cache entry prior to a replacement of said previously stored state 
cache entry, said previously stored state cache entry being associated with a line of memory stored in said shared 
memory and at least one data cache memory; and performing a castout operation to update the line of memory within 

50 said shared memory and assigning a data cache memory line state of SHARED to said line of mennory in each data 
cache mennory containing said line of mennory if said system memory line state for said previously stored state cache 
entry is OWNED (OWNED BUS A or OWNED BUS B). 

The described metfnod reduces the number of coherency operations caused as a result of replacements in the state 
cache memory. Since most lines of memory are in a shared state, setting the default state to a shared state, rather than 

55 a uncached state, reduces the nunnber of invalidate coherency operations which must be performed during state cache 
line replacements. 

The invention will now be described by way of example with reference to the accompanying drawings in which:- 
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Figure 1 is a simple block diagram representation of an eight-processor super high volume (SHV) symmetric mul- 
tiprocessing (SMP) computer system employing currently available commodity components. 
Figure 2 is a block diagram representation of system memory 105A and a cache memory for the storage of state 
information. 

5 Figure 3 is a block diagram representation o1 state cache memory 203 of Figure 2 providing more detail concerning 

the structure and operation of state cache 203. 

Figure 4 is a table illustrating reductions in replacement memory line coherency operations for three possible 
default memory line states In accordance with the present invention. 

Figures 5A and 5B provide a coherency state table for a three bit directory based memory having an "Uncached" 
10 default state in accordance with a traditional replacement procedure. 

Figures 6A and 6B provide a coherency state table for a three bit directory based memory having a "Shared Both" 
default state in accordance with a first embodiment o1 the present invention. 

Figures 7A and 7B provide a coherency state table for a three bit directory based memory having a "Shared Agent 
A" default state in accordance with a second embodiment of the present invention, 

NCR Corporation has developed an advanced multiprocessor architecture utilizing system techniques pioneered 
by NCR while also advantageously making use of standard high volume (SHV) components, such as Intel Pentium Pro 
processors. PCI I/O chipsets. Pentium Pro chipsets. Pentium Pro bus topology (P6). and standard memory modules 
(SIMMs and DIMMs). Through careful integration of NCR system techniques with standard SHV components. NCR is 
20 able to deliver world class scalability and feature content while still capitalizing on SHV and without the disadvantages 
associated with full custom development. One implementation of this architecture Is shown in Figure 1 . 
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System Overview 

Referring now to Figure 1. there is seen an eight-processor SMP system formed of two four-processor bijilding 
blocks or complexes, identified by reference numerals A and B. Each complex is seen to include identical structure and 
components, which are identified by reference numerals ending in either an A or a B. for complex "A" and "B". respec- 
tivelv 

The portion of the system contained in complex A is seen to include up to four processors 101 A connected to a 
high-bandwidth split-transaction processor bus 103A. Associated with each processor 301 A is a cache memory 321 A. 
A system memory 105A is connected to bus 1 03A through an advanced dual-ported memory controller 107A. The proc- 
essor bus 103A is connected to the first port of memory controller 107A. The second memory controller port conriecte 
to a high bandwidth I/O bus 115, also refen-ed to herein as an expansion bus. which provides connection for multiple 
PCI I/O interfaces 109A. All of these coirponents. with the exception of advanced memory controller 107A. are cur- 
rently available commodity components. For example, processors 101 A may be Intel Pentium Pro processors and 
busses 1 03A and 1 15 may be Pentium Pro (P6) bus topology. 

The advanced memory controller (MAC) 107A manages control and data flow in all directions between processor 
bus 103A and I/O bus 1 15. The I/O bus may contain P6 to PCI IAD Bridges and another AMC ASIC tor connectivity to 
another processor bus. as will be discussed below. The AMC 107A also controls access to a coherent DRAM memory 
40 array. The AMC as presently implemented consists of a control and data slice ASIC pair. 

AS stated earlier, complex B has a construction identical to complex A. The two complexes are interconnetted by 
expansion bus 1 15. allowing for communication between the processors 101 A and 101 B. system memories 1 05A and 

1058 as well as shared I/O devices, cache memories, and other components. 

Within each complex, the processors use a bus snooping protocol on the processor bus. Bus snooping is a method 
« of keeping track of data movements between processors and memory. There are performance advantages to this sys- 
tem with a small number of tightly-coupled processors. If a processor needs data that is available in the data cache 
another processor on the same bus. the data can be shared by both processors. OthenA^ise. the data must be retrieved 
from main memory 105A or 105B. a more time consuming operation which requires system bus traffic. This method 
enhances system performance by reducing system bus contention. 
50 The characteristics of the NCR architecture shown in Figure 1 include: 

• Capitalizes on industry SHV architecture and supporting commodity chips (lOB. etc.) 

. Dual ported memory controllers 1 07A and 1 07B permit connection and utilization of dual Ixises, each operating at 

66 MHz with a bandwidth of 64 bits and capable of sustained data transfer rates of 533 MB/s. 

55 . Dual bus approach provides greater scalability through a reduction of bus loadings and provision of a private proc- 
essor to memory path that can operate independent of lOB to lOB traffic. 

• Additional processors and I/O devices can be connected to the expansion bus 1 1 5. 
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The system as described is able to fill High Availability Transaction Processing (HATP) and Scaleable Data Ware- 
house (SDW) server needs, while capitalizing on the computer industry's SHV motion. 

Memory-Based Coherency 

5 

In any system employing a data cache memory, and particularly a system employing multiple data cache memories 
and multiple levels of data cache memories, data from a given memory location can reside simultaneously in main 
memory and in one or more data cache memories. However, the data in main memory and in data cache memory may 
not always be the same. This may occur when a microprocessor updates the data contained in its associated data 

10 cache memory without updating the main memory and other data cache menr»ories, or when another bus master 
changes data in main menrory without updating its copy in the microprocessor data cache memories. 

To track the data moving between the processors, system memory modules 105A and 105B, and the various data 
cache memories, the system utilizes a hybrid of memory and cache based coherency. Coherency between system 
memory and caching agents, i.e.. system bus processors with first and possibly second level data caches, is maintained 

15 via a combination centralized/distributed directory-based cache coherency. 

A directory-based cache coherency scheme is a method of keeping track of data movements between the proces- 
sors and memory. With this approach to data coherency, a memory status table identifies which processors have which 
lines of memory in their associated data cache memories. When a processor requests data, the status table identifies 
the location within main memory or processor data cache where the nriost current copy of the data resides. The advan- 

20 tage of this method is that no additional work must be performed until a processor needs data that resides in a data 
cache that cannot be accessed through snooping. Directory-based cache coherency is most effective with a large 
number of tightly-coupled processors on a system bus. 

The centralized/distributed directory-based cache coherency scheme employed in the system shown in Rgure 1 
consists of two directory elements. The central element within the directory scheme resides in state cache memories 

25 203 A and 203B associated with system memory modules 105A and 105B, respectively This element is refen-ed to as 
the Menrory Line Status Table (MLST). Each active (cached) memory line within system memory includes a con-e- 
sponding entry in the MLST. This corresponding entry contains information indicating whether or not a memory line is 
cached, and if so. whether it is exclusively owned by one processor (or bus), or shared across multiple processors (or 
buses). The directory scheme and MLST can be set up to identify memory line ownership by system bus or by proces- 

30 sor. The "bit-per-bus" MLST distinguishes ownership on a bus basis, while the more granular "bit-per-processor" MLST 
distinguishes ownership on a processor basis. Note that the distinction is specific to a memory design and hence trans- 
parent to any other device on the system txjs. 

Distributed directory elements reside locally within each processor's data cache directory. The element associated 
with a particular processor is refen-ed to as its Processor Line Status Table (PLST). Each cache line has a correspond- 

35 ing entry in the PLST From the local processor's perspective, this entry contains information indicating whether or not 
a line contains a valid copy of a main memory line, and if so, whether or not modifications to that line nrust be broadcast 
to the rest of the system. From the system's perspective, each processor's PLST is a slave to special system bus cycles 
known as Memory Intervention Commands (MICs). These cycles query the PLST as to the local state of a particular 
line, and/or tell the PLST to change that local state. 

40 

Memory and Cache State Definitions 

The Modifled-Exdusive-Shared-lnvalid (MESI) cache coherency protocol is a hardware-implemented protocol for 
maintaining data consistency between main memory and data cache memories. A typical implementation of the MESI 
45 hardware cache cohererKy protocol requires the utilization of cache controllers having the ability to: 

1 . use the same line size for all caches on the memory bus; 

2. observe all activity on the memory bus; 

3. maintain state information for every line of cache memory; and 

50 4. take appropriate action to maintain data consistency within the cache memories and main memory. 

MESI represents four states which define whether a line is valid, if it is availatjie in other caches, and if it has been 
modified. Each line of data in a data cache includes an associated f ieW which indicates whether the line of data is MOD- 
IFIED. EXCLUSIVE. SHARED, or INVALID Within the Processor Une Status Table each cache line is marked in one of 
55 the four possible MESI states: 

• MODIFIED (PM) - This state indicates a line of data which is exclusively available in only this cache, and is modi- 
fied. Modified data has been acted upon by a processor. A Modified line can be updated locally in the cache without 
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acquiring the shared memory bus. If some other device in the system requires this line, the owning cache must sup- 
ply the data. 

EXCLUSIVE (PE) - This state indicates a line of data which is exclusively available in only this cache, that this line 
is not Modified (main memory also has a valid copy), and that the local processor has the freedom to modify this 
line without informing the system. Exclusive data can not be used by any other processor until it is acted upon in 
some manner. Writing to an Exclusive line causes it to change to the Modified state and can be done without 
informing other caches, so no memory bus activity is generated. Note that lines in the (PE) state will be marked 
(MO) in the MLST as will be described below. 

SHARED (PS) - This state indicates a line of data which is potentially shared with other caches (the same line may 
exist in one or more caches). Shared data may be shared among multiple processors and stored in multiple 
caches. A Shared line can be read by the local processor without a main memory access. When a processor writes 
to a line locally marked shared, it must broadcast the write to the system as well. 

INVALID (PI) - This State indicates a line of data is not available in the cache. Invalid data in a particular cache is 
not to be used for future processing, except diagnostic or similar uses. A read to this line will be a "miss" (not avail- 
able). A write to this line will cause a write-through cycle to the memory bus. All cache lines are reset to the (PI) 
state upon system initialization. 

In accordance with the MESI protocol, when a processor owns a line of memory, whether modified or exclusive, 
any writes to the owned line of memory within main memory will result in an immediate update of the same data con- 
20 tained within the processor's data cache memory. ,t,MKin^ 

The Memory Une Status Table marks a memory line in one of three possible states: NOT CACHED (MNC). 
SHARED (MS), and OWNED (MO). The letter M distinguishes these states from PLST states, which are identified by 
use of the letter R Additionally there are bus and/or processor state bits indicating sharing or ownership on either a bus 
or processor basis. 

25 

. NOT CACHED (MNC) : Indicates that no cache has a copy of that line. All memory lines must be reset to the (MNC) 
state upon system initialization. 

• SHARED STATE (MS): Indicates that one or more caches potentially have a copy of that line. 

• OWNED STATE (MO): Indicates that one and only one cache potentially has a copy of that line, and that the data 
30 in memory potentially does not match it (Memory data is refen-ed to as stale). 

Note the word "potentially" used in the definition of the shared and owned states. There are several situations in 
which the MLST does not have the most up-to-date information about a particular memory line. For example, the MLST 
may mark a line as shared by two particular processors since it saw them both read it. However, both processors may 
have long since discarded that line to make room for new data without informing the MLST (referred to as "silent 
replacement") The MLST will naturally "catch up" to the latest state of a particular line whenever an access to that line 
by some master forces a MIC. In this example, a wrHe by a third processor to this line will initiate a (now superfluous) 
MIC to invalidate other cached copies, and will bring the MLST up-to-date. Note however that the MLST always holds 
a consen/ative view of the state of cache lines. That is. a line that is owned or shared by a processor will always be 
marked correctly in the MLST "Stale" information in the MLST takes the form of lines marked owned or shared that are 
no longer present in any processor's data cache. 

There are three distinct MIC operations employed within the system described above in order to maintain coher- 
ency between system memory and the data cache memories: 
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INVALIDATE (MBI) This cycle is initiated to cause all data caches with an addressed line to go to the invalid state 
(PI) It normally occurs in response to certain memory operations that access a line marked shared (MS) in the 
MLST Unlike the other MIC operations, an MBI does not require feedback from any PLST as to the current state of 
the addressed line in a local cache. Rather, the MBI simply requests invalidation of a line if it is present in the cache. 
Although an MBI requires no logical feedback, it does require a positive acknowledgment from the targeted proc- 
essor(s) to complete the cycle. This simply indicates that the processor has accepted the invalidate address and is 
ready for another. . 
CASTOUT INVALIDATE (MBCOl) This cycle is initiated to cause a cache with a potentially modified copy ot an 
addressed line to cast it out to system memory and to go to the invalid state (PI). It occurs in response to certain 
memory operations that access a memory line marked owned (MO) in the MLST If the owning cache has the line 
in the modHied (PM) state, it sipplies the data and goes invalid. If the owning cache has the line in the exclusive 
(PE) state, it acknowledges the MBCOl and goes invalid, but does not supply the data. If the owning cache no 
longer has the line it simply acknowledges the MBCOl to complete the cyde. 

CASTOUT SHARED (MBCOS) This cycle is to cause a cache with a potentially modified copy of an addressed line 
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to cast it out to system memory and to go to the shared state (PS). It occurs in response to certain memory oper- 
ations that access a memory line marked owned (MO) in the MLST If the owning cache has the line in the modified 
(PM) state, it supplies the data and goes to shared. If the owning cache has the line in the exclusive (PE) state. It 
acknowledges the MBCOS and goes to shared, but does not supply the data. If the owning cache no longer has 
5 the line it acknowledges the MBCOS to complete the cycle. Note that in the last case the MLST goes to shared 

(MS) even though the line is not cached. This is because the MLST cannot distinguish a line that is exclusive (PE) 
in the owner*s cache from a line that is invalid (PI). 

As stated above, the MLST includes additional bus and/or processor state bits indicating sharing or ownership on 
10 either a bus or processor basis. 

The Bit-per-Bus Protocol uses three memory state bits per line to indicate the current state of the line. One bit indi- 
cates shared or owned, and the other two depict which bus (A or B) or buses (A and B) have the line shared or owned. 
Bus ownership indicates that one of the processors on that bus owns the line. Six states are possible: UNCACHED, 
SHARED BUS A. SHARED BUS B. SHARED BOTH. OWNED BUS A and OWNED BUS B. Note that a line can be 
75 owned by only one processor and therefore by only one bus. A shared line can be shared by one or more processors 
on each bus. 
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Memory State Bits for Bit-per-Bus Protocol 


DBA 


STATE BIT DERNI- 
TIONS 


DESCRIPTION 


000 


MNC - Not Cached: 


Not owned or shared 


001 


MS - Shared; 


Shared on Bus A 


010 


MS - Shared; 


Shared on Bus B 


oil 


MS - Shared; 


Shared on Buses A and B 


100 


X - (not a valid state) 




101 


MO - Owned; 


Owned by Bus A 


110 


MO - Owned; 


Owned by Bus B 


111 


X - (not a valid state) 





The Bit-per-Processor Protocol has an MLST consisting of n+l bits per line (n is equal to the number of processors) 
to indicate the current state of that line. One bit indicates whether the line is shared (MS) or owned (MO), and the other 
n bits depict which processor or processors have the line cached. A particular processor is numbered Pi. where i = 0 to 
40 n-1. All Pi, where i is even, are on bus A, and all Pi. where i is odd. are on bus B. Processor ownership indicates which 
processor (only one) owns the line. A shared line can be shared by one or more processors on either or both buses. 
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Memory State Bits for Bit-per-Processor Protocol 


0 


P0..Pn-1 


STATE BIT DEFINI- 
TIONS 


0 


all zeros 


MNC - Not Cached 


0 


one or more set 


MS • Shared 


1 


only one set 


MO - Owned 


1 


more than one set 


X - (not a valid state) 


1 


all zeros 


X - (not a valid state) 



6 



EP 0 847 011 A2 



20 



25 



Memory State Cactie 

As described earlier, the MLST containing state information associated with system memory 1 05A or 1 05B is main- 
tained within state cache memories 203A and 203B. respectively. The state cache memories are sized to store state 
5 inlormation for only a portion of the memory lines included in system memory in recognition that rarely will all o1 system 
memory be utilized (cached) at any one time. The structure and operation of an exemplary state cache memory 203A 
is illustrated in Figures 2 and 3. 

Figure 2 shows system memory 105A having a size, for example, of one gigabyte (2^^ bytes) divided into 
33,554,432 (2^^ million memory blocks or lines, each line having a size of 32 bytes. Data stored within memory 105A 
10 is accessed by submitting a 29-bit address 201 . The 25 most significant bits within the address, identified as "X'' bits, 
identify the memory block or line number. The next 4 address bits, identified as "VT bits, point to the word within the 
memory block, while the least significant address bit "B" identifies the byte within a word. 

The cache memory 203 A providing tor the storage of state information includes two sections Identified as the 
"cache tag RAM" 205A and the "cache data RAM" 207 A, Each line entry within state cache memory 203A contains 
75 state information saved to the cache data RAM and a four bit tag stored within the cache tag RAM. 

State cache memory 203A is indexed by a subset of the total number of memory address bits. The remainder of 
the address bits, or tag bits, are part of the contents for each entry in the storage device. The index bits define the 
number of entries in the state cache, and the tag bits determine the number of memory lines which can contend for the 
same entry in the state cache. Ihe index bits plus the tag bits define the total number of memory lines which can be 
supported. In essence, the reduction in state storage is defined by the number of tag bits. For example, if the number 
of tag bits is four, then the state storage requirements for this concept are one sixteenth that of the traditional architec- 
ture. 

More detailed information concerning the structure and operation of the state cache 203A is shown in Rgure 3. 
Entries within the state cache are accessed by submitting the same address 201 used to access main memory 105A. 
The four most significant bits within the address are identified as tag bits, and the next 21 address bits are identified as 
index bits. These 25 address bits are the same bits identified as X bits in Figure 4. and which are ised to identify mem- 
ory blocks within main memory 105A. 

During a state cache read operation, the index field of tiie address is used to specify the particular entry or line of 
cache to be checked. Next the tag bits of the address are compared with the tag of the selected cache line. If there is 
30 a match a cache hit occurs and the state bits associated with the selected cache line are retrieved. 

To store state information within the state cache memory, the index field of an address is used to identify a partic- 
ular entry or line of cache for tag and state information storage. The first four address bits are saved to the cache tag 
RAM while the state information associated with the address are saved to the cache data RAM. Optionally, error detec- 
tion and/or correction syndrome bits or parity bits could be included in the state cache memory. 
35 The state cache, as described, is a direct mapped cache. Note, however, that the state cache can be associative, 
sectored, or direct mapped as with data caches. 

The operation of the memory system is as follows: When a read or write operation is requested of the system mem- 
ory the state cache is accessed to determine the coherency cycles necessary, dependent on the protocol. If the tag 
information in the state cache match the corresponding bits of ttie memory address, tiien the con^esponding coherency 
40 cycles are performed and the state updated. If there is not a tag match, then coherency operations for the default state 
are performed (possibly none), and the new line address and stale are allocated to the state cache. Possibly an existing 
entry will be replaced by the new line. Coherency operations may be required to bring the replaced line state to the 
default state. These replacement coherency operations are the performance cost for reducing the amount of state stor- 
age, but as mentioned above are negligible for a reasonable state cache size and typical workload. Note that the state 
45 cache can be associative, sectored, or direct mapped as with data caches. 

The memory space saving provided though use of a state cache memory is illustrated in tfie following example. 
Consider the system, described earlier, having one gigabyte of memory and a 4-bH coherency state field required per 
line of memory The basic coherency block or line of memory is 32 bytes. To store tfie 4 bit state for all of memory wouW 
require 1 6 MB of state memory (32 million lines times 4 bits per line). If each entry in tiie state memory contains a 4 bit 
50 tag. the state memory would contain 8 bits of information per line which is double the ti^aditional amount. However, only 
one sixteenth as many lines are needed due to the 4 bit tag. Therefore, the total state memory required is 2 MB vvhich 
is only one eighth of the fraditional amount. The tradeoff is possible replacements of the state cache entries which are 
relatively few. In this example, the state cache is equivalent to a 64 MB data cache (2 million entries, each representing 

55 ^One d?2dvantage which arises from the utilization of a memory state cache as above-described process is that 
additional coherency actions resulting from replacements of cached state entries may interfere with normal transfers^ 
The present invention provides a method for reducing ttie number of coherency operations caused as a result of 
replacements in a directory based menrary state cache. 
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Reduction of Replacement Operations 

As stated earlier, when information about an address which is not stored in the state cache is needed, a previously 
stored state cache entry must be replaced to allow allocation of the new address. The current state of the art protocols 

5 restore the system state of the memory line (basic coherency element) to noncached (not shared or owned). This 
involves invalidate and castout operations to the processor data caches. 

In the directory based cache coherency system thus far described, and in many similar systems, it should be rec- 
ognized that most state cache lines are marked as shared after a warm-up period, it is thereby possible to reduce the 
number of coherency operations as a result of replacements by defining a default memory state of shared. Thus the 

10 replacement algorithm employed to handle operations resulting from replacements of cached state entries must only 
insure the shared state. Since most lines of cache are in the shared state this means no invalidates, and the data 
caches continue to retain the memory information in the shared state. If the replaced line is in an owned state, then a 
castout operation will be generated, but again the line can transition to the shared state in the data caches. 

The tradeoff for this method is that when a line is allocated to the state cache in an ownership state (caused by write 

75 misses), invalidates wilt be necessary to all the possible sharers of that line in all the data caches. However, write and 
read misses cause allocates which cause replacements. If less than half the number of misses are write cycles then 
there will be noore replacement invalidates saved than invalidates caused by write cycles since read cycles allocate 
shared entries in the cache. 

As an example, consider a directory t>ased memory system with 3 bits of state information per memory line (coher- 

20 ency element) for two caching agents. Six of the eight possible states typically used are: uncached, shared agent A. 
shared agent B. shared both, owned agent A, and owned agent B. Figure 4 provides a table showing the replaced line 
operations for the traditional algorithm and the claimed method with a "Shared Both" or a "Shared Only" default state. 
Note a shared agent A or shared agent B default has merit if one agent has faster or higher priority access to memory; 
thus invalidates would be quicker for allocates to the state cache. 

25 For the Shared Both default state protocol, replacement actions are required only if the entry is in an ownership 
state, and the agents are allowed to keep the lines in the shared state bcally. For the "shared A" default state protocol, 
invalidates are required to agent B if the line is "Shared by Agent B" or "Shared by Both*. However, less extra invalidates 
are required while allocating to the state cache for this case. The table provided by Figures 5A and 5B shows the coher- 
ency state table for a three-bit directory based memory with an uncached default state. The tables shown in Figures 6A- 

30 SB and 7A-7B show the coherency states for "Shared Both" and "Shared Agent A" default states. The states which 
cause some additional invalidates are shown In bold font. For the "Shared Both" default case, a write miss will cause an 
invalidate to the opposite agent and a DMA write will cause invalidates to both agents. For the "Shared A" default case, 
a write miss by agent A will only cause a local coherency cycle, and a DMA write will only invalidate agent A. Therefore 
less invalidates occur for this case. (Note: if the I/O interface is local to either agent then invalidates can be handled 

35 locally for that agent, for a local snooping protocol, with the mennory invalidating the opposite agent). 

It can thus be seen that there has been provided by the present invention a new and useful method for inrproving 
memory utilization within a computer system employing centralized/distributed directory-biased cache coherency and a 
state cache memory. The described method reduces the number of coherency (invalidate) operations caused as a 
result of replacements in the state cache memory. 

40 

Claims 

1. A method for replacing entries within a state cache memory (303A. 303B) of a multiprocessor computer system 
which includes the state cache memory (303A. 303B). a shared system memory (105A, 105B), a plurality of data 

45 cache memories (121 A. 121 B), a system of busses (103A. 103B) interconnecting the system memory with said 
plurality of data cache memories, the computer system employing a centralised/distributed directory based cache 
coherency scheme for maintaining consistency between lines of memory within said shared system memory 
(105A. 105B) and said plurality of data cache memories (121 A. 121B). the method comprising the steps of; 

so establishing a default memory state of SHARED for lines of memory represented in the state cache memory 

(303A, 303B), 

reading, prior to its replacement, the system memory line state for a state cache entry associated with a line of 
memory stored in the shared memory (105A, 105B) and at least one data cache memory (1 21 A. 121 B), and 
performing a castout operation to update the line of memory within said shared memory (105A. 105B) and 
55 assigning a data cache memory line state of SHARED to said line of memory in each data cache memory con- 

taining the line of memory if said system memory line state is OWNED. 

2. A method as claimed in daim 1 in which the multiprocessor system to which the method is applied comprises a 
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data cache memory (121 A. 121 B) associated with each processor (101 A, 101 B). each one of the data cache mem- 
ories (121 A. 121 B) containing a data cache memory line state with each line of memory saved within the data 
cache memory, the data cache memory line state being any one of the group: MODIFIED, EXCLUSIVE, SHARED 
or INVALID; the state cache memory (303A. 303B) containing a system memory line state for a predetermined 
number o1 lines of memory saved within said system memory (105A. 105B). said system memory line state being 
any one of the group: SHARED BUS A. SHARED BUS B, SHARED BOTH. OWNED BUS A and OWNED BUS B; 
the system of busses including first (BUS A) and second (BUS B) memory busses, and each memory bus connect- 
ing a subset of the processors (101A. lOiB) and associated data cache memories (121A, 121B). said system 
memory and said state cache memory, 

and the step of performing a castout operation to update the line of menrrary within said shared memory and 
assigning a data cache memory line state of SHARED to said line of memory in each data cache memory is 
carried out if said system memory line state is OWNED BUS A or OWNED BUS B. 

15 3. A method as claimed in claim 2, further comprising; 

performing an invalidate operation to assign a data cache memory line state of INVALID to said line of memory 
in each data cache memory (121 A, 121 B) connected to said first memory bus (BUS A) or said second memory 
bus (BUS B) if said system memory line state is SHARED BOTH. 

20 
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shared memory and assigns a data cache memory line 
state of SHARED to the line of memory in each data 
cache memory if the system memory line state is 
OWNED. 
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